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Abstract 



Providing information to test takers and test score users about the abilities of test takers 
at different score levels has been a persistent problem in educational and psychological 
measurement (Carroll, 1993). Scale anchoring (Beaton & Allen, 1992), a technique that 
describes what students at different points on a score scale know and can do, is a tool to 
provide such information. Scale anchoring for a test involves substantial amount of work, 
both by the statistical analysts and test developers involved with the test. In addition, 
scale anchoring involves considerable use of subjective judgment, so its conclusions may be 
questionable. This paper describes statistical procedures that can be used to determine if 
scale anchoring is likely to be successful for a test. If these procedures indicate that scale 
anchoring is unlikely to be successful, then there is little reason to perform a detailed scale 
anchoring study. The procedures are applied to several data sets from a teacher licensing 
test. 

Key words: augmented subscore, distinctness, mean-squared error, proportional reduction 
in mean-squared error, reliability 
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Testing companies are under constant pressure to produce information in addition 
to the overall test score. Subscores (e.g., Sinharay, Haberman, & Puhan, 2007) are one 
potential source of such additional information. Another source is information for test 
takers and test score users concerning the type of tasks examinees at specified score levels 
are typically able to perform. Although such information might appear to be readily 
supplied, in practice the task has been a persistent problem in educational and psychological 
measurement (Carroll, 1993). Testing companies have been investigating solutions to this 
problem through the development of proficiency scaling procedures and question-difficulty 
research. Scale anchoring (Beaton & Allen, 1992), which results in descriptions of what 
students at different points on a score scale know and can do, is a tool to provide such 
information concerning the relationship between tasks examinee can perform and observed 
test scores. For example, a scale anchoring study for the TOEFL iBT Reading section 
(Garcia Gomez, Noah, Schedl, Wright, & Yolkut, 2007) found, among other things, that the 
test-takers who obtain a high score (22-30) in TOEFL iBT Reading typically have a very 
good command of academic vocabulary and grammatical structure. Scale anchoring has 
been used with a variety of assessments, including the National Assessment of Educational 
Progress (NAEP; Beaton & Allen, 1992) and the Trends in International Mathematics 
and Science Study (TIMSS; Kelley, 2002). The procedure of scale anchoring produces 
performance-level descriptors or PLDs (Perie, 2008), which describe the level of knowledge 
and skills required of different performance levels. 

The U. S. Government’s No Child Left Behind (NCLB) Act of 2001 demands, among 
other things, that students should receive diagnostic reports that allow teachers to address 
their specific academic needs; scale anchoring could be used in such a diagnostic report. 
Some researchers (e.g., Sinharay & Haberman, 2008) recommended consideration of scale 
anchoring for tests that are under pressure to report additional information, but do not 
have high-quality subscores. 

Nonetheless, scale anchoring is not without problems. Linn and Dunbar (1992) 
described the confusion of the general public about the meaning of NAEP data related to 
score anchors. They concluded that the reasons for the discrepancy between the percentage 
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of examinees who answer an anchor item correctly and the percentage who score above 
the corresponding anchor point may be too subtle for mass communication. Phillips et al. 
(1993) described the potential danger of overinterpreting examinee performance at anchor 
points so that all examinees at a particular level are assumed to be proficient at all abilities 
measured at that level. 

The steps required in scale anchoring are the following: 

1. Select a few carefully dispersed points on the score scale ( anchor points) that will be 
anchored. 

2. Find examinees who score near each anchor point. 

3. Examine each item to see if it discriminates between successive anchor points, that 
is, if most (greater than 50%) of the students at the higher score levels can answer it 
correctly and most (less than 50%) of the students at the lower level cannot. 

4. Review the items that discriminate between adjacent anchor points to find out if spe- 
cific tasks or attributes that they include can be generalized to describe the level of 
proficiency at the anchor point. What students at various scale points know and can 
do can be summarized this way. 

The above description shows that scale anchoring involves a statistical component 
(the first three steps) that identifies items that discriminate between successive points on 
the proficiency scale using specific item attributes (Beaton & Allen, 1992). These steps 
are closely related to the common process of item mapping. The fourth step involves 
generalizations not required in item mapping. Scale anchoring involves a consensus 
component in which identified items are used by subject-area and educational experts to 
provide an interpretation of what groups of students at or close to the selected scale points 
know and can do. This consensus component can be costly (because of the involvement of 
subject-area and educational experts) and can be quite time-consuming. In addition, the 
subjective judgment involved may not be reliable. 
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As Beaton and Allen (1992) noted, the scale anchoring process is not guaranteed 
to result in useful descriptions of the anchor points. A test that is well-designed for its 
intended purpose may not have sufficient information available to differentiate between 
performance of examinees at given score levels on items with different attributes. In some 
cases, this failure may simply reflect lack of a sufficient number of items anchoring at given 
score levels. It may also be true that the items at an anchor level are too dissimilar to 
interpret. 

Therefore, before performing an exhaustive scale anchoring study, it may be 
beneficial if a set of simple statistical analyses can be performed to find out if a scale 
anchoring will provide useful results. This paper suggests such a set of analyses — they 
include simple regression analysis and fitting of several popular item response theory (IRT) 
models. The next section discusses our suggested set of techniques and describes why they 
are appropriate. The techniques are applied to several data sets from a teacher licensing 
test in the application section. Conclusions and recommendations are provided in the last 
section. 



1 Methods to Predict Success of Scale Anchoring 

The description of scale anchoring given in the previou section indicates that scale 
anchoring can only succeed (which means that it can provide useful information to the 
examinees) if, for each pair of successive anchor points (which correspond to a small range 
of ability or difficulty level of items, for example, a range of proportion correct of 0.60 
to 0.75), there are items with specific attributes that most students at the lower point 
cannot answer but most students at the higher point can, that is, the items are highly 
discriminating at specific levels of difficulty. Thus scale anchoring can only succeed if item 
attributes can predict item difficulties to an adequate degree and if item discriminations 
associated with these item attributes are high. If item attributes do not predict item 
difficulties well, then the items discriminating between adjacent anchor points will not be 
readily interpreted in terms of item attributes. Unless item discriminations are consistently 
high, it is also necessary for item attributes to predict item discrimination. In Step 4 of 
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the description of scale anchoring, the required generalizations will not be feasible. Hence, 
the key to our suggested techniques is an examination of how well item attributes predict 
item difficulties and item discriminations. A mathematical proof of why it is necessary and 
sufficient for the item attributes to predict item difficulties and item discriminations for the 
success of scale anchoring is given towards the end of this section. 

The techniques that will be suggested here assume availability of test data concerning 
item attributes along with variables used in test development to characterize items in 
terms of features such as domain covered or type of tasks covered. Such variables are 
usually available because test developers use them to create test forms that conform to 
specifications. A common problem will be that the test design is likely not to be optimal 
for the purpose of inferences concerning item attributes. This issue will receive further 
attention in the concluding section. 

The first technique that can be used is simple linear regression of item statistics 
(item difficulty or item discrimination) on indicators of appropriate item attributes (see 
Sheehan & Mislevy, 1994, for examples of such analyses). The squared multiple correlations 
from these regressions will provide an idea of how well the item statistics can be predicted 
by the item attributes. 

The second set of techniques involves fitting of several item- response theory (IRT) 
models to the data. In the operational data examples considered later, all items are 
right-scored and n examinees respond to m items. Associated with Item i are item 
attributes 1 < k < K, for some integer K > 1. The q t k are indicator variables, 
with q,k = 1 if Attribute k is present for Item i and q t k = 0 otherwise. The response of 
Examinee s, 1 < s < n, to Item i is X ls , and the latent proficiency parameter of Examinee s 
is a random variable 9 S with a standard normal distribution. Conditional on 9 S , the Xi S 
are mutually independent and the probability that X is = 1 is p is . The logit of p is is 
X is = log , so that 

_ exp(A^) 

Pls 1 + exp(A is ) ‘ 

All models considered are special cases of the two-parameter logistic (2PL) model M 2 in 
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which 



= aid s ( 1 ) 

for the discrimination a* and the intercept Pi of Item i. If a* > 0, then bj = Pi/a. L is the 
difficulty of Item i. In the one-parameter logistic (1PL) model Mi, also known as the Rasch 
model, it is assumed that the discrimination parameter a* is the same for all items. In 
the zero-parameter logistic (OPL) model M 0 , it is assumed that both the discrimination 
parameter a* and the intercept parameter Pi are the same for all items. In the independence 
model Mi (Haberman, 2006), it is assumed that the item discrimination parameter a t is 0 
for all items, so that the X ts are mutually independent and pi S does not depend on 9 S . 

Models M 0 , Mi, M 2 , and Mj do not use the indicator variables q^. In several 
models, these indicators are employed to predict item parameters. In the linear logistic test 
model (LLTM; Fischer, 1973) Ml, the Rasch model Mi is assumed, and it is assumed that 
the item intercept satisfies a linear model 

I< 

Pi = VlQil + VlQi2 H 1- mqiK = 22 nk(llk ( 2 ) 

k = 1 

in which 77 represents the effect of Attribute k on the intercept Pi of Item i. Model Ml 
reduces to the OPL model Mo if K = 1 and q % \ = 1 for all Items i. Model Ml is the same as 
the Rasch model M\ if K = m and the m by K matrix Q of <r/ lfc has rank m. The linear 
logistic test model has two generalizations to 2PL models. In the constrained 2PL model 
(Embretson, 1993) Me, it is assumed that the difficulty 

bi = ^2 7 kqik (3) 

k 

of Item i satisfies a linear model in which 7*, represents the effect of Attribute k and the 
item discrimination satisfies a linear model 

di = ^ T kQik ( 4 ) 

k 

in which 77 represents the effect of Attribute k. If q % \ = 1 for all % and K = 1, then the 
constrained 2PL model reduces to the OPL model Mo. In the alternative constrained 2PL 
model Ad a, (2) is assumed to hold for some 77 and (4) is assumed to hold for some 77. 
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To compare models, the information-theoretic measure minimum estimated expected 
log penalty per item (MEELPPI; see, e.g., Gilula & Haberman, 2001; Haberman, 2006) 
may be employed. For Model M x , the MEELPPI is obtained as H x = —£ x /(2 nm), where 
l x is the maximum log-likelihood under the model. For example, H\ is the MEELPPI for 
Model Mi and Ha is the MEELPPI under Model Ma- Among the models under study, 

//o > Ilf. > Ih > H 2l H 0 > H l > H c > H 2 , H 0 > H l > Ha> H 2 , and II, > //, > II 2 



because, for example, M 0 is a special case of Ml and M 2 is a special case of M 2 . An LLTM 
model is most attractive if Hl is close to II a . He, Hi, and tl>. The constrained 2PL 
model is most attractive if He is close to H 2 , and the alternate constrained 2PL model is 
most attractive if Ha is close to H 2 . Evaluation of closeness can be considered in terms of 
relative reduction of MEELPPI and in terms of reductions of MEELPPI per independent 
parameter. Let M x have d x independent parameters, so that d 0 = 2, dj = m, di — m + 1 , 
d 2 = 2m, d,L = 1 + K , and dc = d.A = 2 K. If Model M x implies Model M y but the models 
are not equivalent, then the improvement in MEELPPI per independent parameter is 



_H X ~Hy 

^xy , , • 

dy d x 

Larger values of u xy are favorable for Model M y . If Model M x implies Model M y , and 
Model M y implies Model M z and the models are not equivalent, then one may examine the 
relative improvement 




H X ~Hy 

H x - H z 



to know how M y compares to Model M z , where M x provides a baseline for comparison 
of M y and M z . Values of K xyz near 1 are desirable. It is certainly desired that K xyz be 
somewhat larger than ( d y — d x )/(d z — d x ), so that the gain per independent parameter from 
Model M x to Model M y is somewhat larger than is the gain per independent parameter 
from Model M y to Model M z . 

For example, consider an evaluation of the linear logistic test model (Model Ml). 
Consider a comparison to the Rasch model (Model Mi) where the OPL model (Model M 0 ) 
provides a baseline for comparison. Assume that 0 < K < m. It is desirable that 



u 0L = (K-1)-\H 0 -Hl ) 
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be somewhat larger than is 



v LX = (m-K)~\H L -H_ 



It is also desirable that 



p2 

n OLl — 



Ho — Hr 



Hq — Hi 

be somewhat larger than ( K — l)/(m — 1). Favorable results suggest some ability to predict 
item intercept by use of item attributes. In the Rasch case, the ability to predict item 
intercept is equivalent to the ability to predict item difficulty. 

Similar arguments can be applied to the constrained 2PL model Me or the alternate 
constrained 2PL model Ma ■ In the case of Me, it is important to examine 



be somewhat larger than 



It is also desirable that 



V 0 C= [2{K-l)]-\Ho-H c ) 



v C2 = [2(m-K)}-\H c -H 2 ) 



p 2 

rt 0C2 — 



Hi, - H, 



c 



H 0 -H 2 

be somewhat larger than (Ji — l)/(m — 1). Favorable results suggest some ability to predict 
item difficulty and item discrimination from item attributes. 

In principle, it is possible to apply chi-square tests to compare models. Let Model 
M x imply Model M y , and let Models M x and M y not be equivalent. If Model M x holds, 
then the likelihood-ratio chi-square statistic L 2 y = 2 nm(H x — H y ) has an approximate 
chi-square distribution on d y — d x degrees of freedom. In large samples, L 2 will be quite 
large even if the deviation of Model x from the data is small, so that this approach is not 
very helpful in practice. In all cases in this report, L 2 y is highly significant. 

To discuss the relationship of model parameters in the 2PL model, let us consider 
an item that anchors at 9 S = u. Then, from the earlier description of scale anchoring, 
the probability of a correct response is at least p for 6 S = to and no more than q < p for 
6 S = v < u. That means 

aiUJ - pi > log[p/(l -p)\, 
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and 



aiV - Pi < log[g/(l - q)}. 



The above inequalities imply that the discrimination parameter a* must be at least 

log[p/(l ~ P)] ~ log[g/(l ~ g)] /g 

u — V 

which indicates that an item with a very low discrimination parameter may not anchor at 
all. Given a, > 0, the intercept parameter /% must be between 

diV - log [^/(l - q)} 



and 

aiU - log[p/(l -p)\ r 

so that the difficulty parameter must be between 

v - a“ i log[g/(l - q)} 



and 

^ -ffi _i logb/(l ~p)}- 

Suppose that the 2PL model is a reasonable approximation to the data and the item 
discrimination parameter a* is sufficiently large that (5) holds. For example, if u — 0.5, 
v = 0, p = 0.6, and q = 0.4, then the discrimination must be at least 1.6. In addition, 
unless di is somewhat larger than 1.6, the interval for the item difficulty will be very 
narrow. If scale anchoring is informative for this data set, that means that this item and 
a few other items that anchor at 9 S = u possess a few specific item attributes. That 
in turn implies that these item attributes determine the above mentioned bounds on a* 
and bi, or, in other words, that the item attributes predict item discrimination and item 
difficulty. On the other hand, if the item attributes predict item discrimination and item 
difficulty adequately, the above mentioned bounds will be associated with a few specific 
item attributes; these item attributes are then associated with 9 S = u, which means that 
scale anchoring is informative for this data set. Thus, a necessary and sufficient condition 
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for scale anchoring to be informative is that item attributes predict item discrimination 
and item difficulty adequately. In typical cases, adequacy involves item difficulty more than 
item discrimination, for the requirement on item discrimination involves sufficiently high 
discrimination while the requirement on difficulty involves falling within the proper range. 

When scores are obtained in terms of number of items correct, one may still 
compare distribution functions of 9 S and the number of items correct in order to obtain the 
appropriate analysis in terms of the 6 S parameter. 

There has been a substantial research on prediction of item discrimination and 
item difficulty from item attributes. Simple regression models and tree-based regression 
models have been applied to examine prediction of item difficulty from item attributes for 
several tests such as Praxis™, GRE®, and NAEP reading (e.g., Sheehan & Mislevy, 1994; 
Sheehan, Kostin, & Persky, 2006; Wainer, Sheehan, & Wang, 1998). These studies show 
low to moderate amount of success in predicting item difficulty from item attributes. For 
example, Sheehan and Mislevy (1994) reported that item attributes explained between 20 
and 40% of the variance in item difficulty and between 4 and 14% of the variance in item 
discrimination for 510 pretest items from a Praxis I® test that measures mathematics, 
reading, and writing; Sheehan et al. (2006) reported that item attributes explained between 
14 and 50% of the variance in item difficulty for NAEP Reading. Nonetheless, it should 
be emphasized that the reported values were not examined by either cross-validation or 
by rigorous statistical analysis designed to adjust for the effects of selection bias. In this 
report, in addition to the IRT analysis, conventional regression analysis is performed to 
predict basic item statistics. 



2 Application 

Data From a Scale Anchoring Study 

A scale anchoring study was recently performed using four forms of a teacher 
licensing test in mathematics. The least and largest possible scaled scores for the test are 
150 and 190. The four anchor levels considered were 150 to 168, 169 to 173, 174 to 178, and 
179 to 190. The score 169 is the least passing score among the states that use the test, 178 
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is the largest passing score, and 173 and 174 lie approximately midway between the least 
and largest passing scores. 

For the anchor level i, i = 2,3,4, an item anchored if: 

• At least 65% of examinees scoring in the range defined by the anchor level % answered 
the item correctly. 

• At most, 50% of examinees scoring in the range defined by the anchor level % — 1 
answered the item correctly. 

Because the above criteria led to few items being anchored, items that meet a less 
stringent set of criteria were also identified. The criteria to identify items that almost 
anchored were the following: 

• At most, 60% of examinees scoring in the range defined by the anchor level i — 1 
answered the item correctly. 

• The difference between the percentage of examinees in the range defined by anchor 
level i that answered the item correctly and the percentage of examinees in the range 
defined by anchor level i — 1 that answered the item correctly is at least 15%. 

To further supplement the pool of items, those that met only the criterion of at least 
65% of the students answered correctly (regardless of the performance of examinees at the 
next lower level) were identified. The three categories of items, shown in Table 1, ensure 
that there were enough items available to inform the descriptions of examinee achievement 
at the anchor levels. 

The next step was the consensus component where the subject-area experts (that is, 
the test developers) reviewed the items that anchored and tried to interpret the results. 

The outcome of the scale anchoring procedure were statements such as that the 
examinees in Group 2 can (a) order positive integers, (b) follow simple directions (two steps 
or fewer), and so on. The participants of the consensus component of the study found the 
component to be quite tedious and they often struggled to come up with a meaningful list 
of skills at any anchor level. 
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Table 1 

Number of Items That Anchored 



Anchor 

level 


Anchored 


Almost 

anchored 


Met the 
65% criterion 


Total 


2 


7 


8 


14 


29 


3 


2 


6 


13 


21 


4 


25 


22 


10 


57 


Total 


34 


36 


37 


107 



Note. The total number of items in the four forms is 160. 



Results 

Test developers classify each item in the test into one of two classifications (referred 
to as IT) based on item type (pure or real) and one of five classifications (IC) based on 
item content (algebra, data analysis and probability, geometry, measurement, numbers and 
operations). These classifications, along with several other classifications, are used by the 
test developers to assemble test forms that conform to specifications. We had the IT and 
IC classifications available for all items in Forms 1 to 4. In addition, for only one of the four 
test forms (referred to as Form 1), we obtained a table that shows a list of 63 attributes (for 
example, one attribute is whether the item has a stimulus such as a table/figure or not) 
and the attributes (out of these 63) that apply to each item — the content experts created 
this table during the scale anchoring procedure. 

Results from the fitting of simple regression models. We fitted the 2PL 
model to data from Forms 1 to 4. Then, for each form, we used a simple linear regression 
model to predict the 40 estimated item difficulty parameters 6* and the estimated item 
discrimination parameters a % from indicators of the IT and IC classifications. To avoid 
linear dependence of indicator variables, only five of the seven indicator variables plus a 
constant predictor can be employed. The regression model performed quite poorly. The 
F statistics provided no indication that any relationship between the dependent variables 
and the indicator variables existed. The squared multiple correlation coefficient R 2 ranged 
between 0.05 and 0.16 for the model predicting estimated item difficulty, and between 
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0.03 and 0.30 for the model predicting estimated item discrimination. Similar results 
are obtained if, instead of the estimated difficulty and discrimination parameters, item 
proportions correct and item i?-biserial correlations are used as the response variables in 
the regressions. Note that, if IT and IC classification have no effect on the dependent 
variable and if the dependent variable is normally distributed, then the R 2 statistic has a 
mean of 5/39 = 0.13 and a standard deviation of 



~ (5/2) [(39 — 5)/2] 
_(39/2) 2 (l + 39/2) 




0.07, 



the probability is 0.95 that R 2 is no greater than 0.27, and the probability is 0.99 that 
R 2 is no greater than 0.35 (Rao, 1973, chapter 3). Thus no evidence exists that the 
IT and IC classifications are useful in predicting the four item statistics estimated item 
difficulty, estimated item discrimination, proportion correct, and i?-biserial correlation. 
This conclusion reflects two considerations. An R 2 of 0.3 or less does not indicate much 
ability to predict an item attribute. In addition, in view of the eight R 2 statistics examined, 
the fact that the largest is about 0.30 provides no clear evidence that any relationship at 
all exists between item difficulty and item discrimination on the one hand and the IC and 
IT attributes on the other hand. 

For Form 1, we performed a stepwise linear regression (Draper & Smith, 1998, 
chapter 15) to predict the estimated item difficulty parameters and the estimated item 
discrimination parameters from the indicators of the 63 item attributes. The trivial 
indicator function with value 1 for all items was always included. Variables were added one 
by one to the model only if the F statistic for a variable was significant at the 0.15 level 
(the default value in SAS version 9.2 for stepwise linear regression). The same criterion 
was used for removal of variables. At first glance, the results might appear more promising 
than for the regressions on IT and IC classification. The algorithm picked six nontrivial 
attributes out of the possible 63 in predicting the estimated item difficulty parameters, and 
the resulting R 2 statistic was 0.43. In the case of item slope parameters, eight nontrivial 
item attributes were chosen, and the resulting R 2 was 0.64. Only one nontrivial item 
attribute was included in both the final model for item discrimination and the final model 
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for item difficulty. Nevertheless, cross-validation shows that the apparently high R 2 values 
are a deceptive artifact of the fact that the stepwise regression procedure, when applied 
with a level of a, has an actual level that is much larger than a and tends to admit more 
predictors than is appropriate; see, for example, Draper and Smith (1998, pp. 342-343). To 
examine this issue, a cross-validation procedure was employed in which a series of stepwise 
regressions were employed in which one item i was removed. The regression without Item i 
was then used to obtain a prediction Y t of the value Yj of the dependent variable for Item i, 
where Yj is either a* or bj. The estimated mean-squared error was given by 

m 

= m_1 ~ ^) 2 ‘ 
i= 1 

This mean-squared error was then compared to the estimated mean-squared error obtained 
from the same cross-validation procedure by prediction of Y t by the arithmetic mean YJ of 
the observations Yj, j ^ i. This mean-squared error is 

m 

= m -1 Yi - Yj ) 2 = [m/(m - l)]s 2 , 

i= 1 

where s is the sample standard deviation of Yj, 1 < i < m (Haberman & Sinharay, 2008). 
The proportional reduction of mean-squared error from use of the stepwise regression rather 
than a constant predictor is then 

r 2 = i - 

The observed values of R 2 were —3.70 for item difficulty and —2.16 for item discrimination, 
so that the results of the stepwise regression could reasonably be regarded as much worse 
than useless. An alternative approach to stepwise regression can be adopted with a much 
stricter criterion for entry and removal of variables based on the Bonferroni inequality. To 
ensure that the probability is no greater than 0.15 that a variable will be entered at all if 
the dependent variable is independent of the independence variables and the dependent 
variable has a normal distribution, one requires a significance level of 0.15/63 = 0.00238 
(Draper & Smith, 1998, p. 142). When a level of 0.00238 was used, no indicators of item 
attributes were entered at all for either item discrimination or item difficulty. Note that 
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were tree regression applied in this example and were the Bonferroni approach used, it is 
also true that no variables would be entered, so that no tree construction would occur. 
Criteria for tree branching comparable to those for stepwise regression would encounter the 
same problems of cross-validation found with stepwise regression. 

Results from the fitting of IRT models. We fitted Models Mj, M 0 , Mi, M 2 , 
Ml, Me, and Ma to Forms 1 to 4. Results for Ma are essentially the same as for Me, so 
that they are not reported. In the case of the LLTM (Ml) and the constrained 2PL model 
(Me), a model based on the six linearly independent IC and IT indicators was employed for 
all four forms. In addition, for Form 1, models Ml and Me were applied with 14 indicator 
functions. One indicator was 1 for all items, and the other indicator functions were those 
used in the final model from either the stepwise regression for item difficulty or the stepwise 
regression for item discrimination. 

Table 2 shows the values of MEELPPI for Form 1. Each row corresponds to a 
model. The table shows, for each model, the following quantities: 

• The number of parameters 

• MEELPPI 

• The correlation between the proportion correct p+ and the estimated difficulty from 
the model (denoted as Cor(5,p+) in the table) 

• (For only the LLTM and constrained 2PL model.) The correlation between the esti- 
mated difficulty from the model and the estimated difficulty from the corresponding 
unrestricted model (which is the Rasch model for the LLTM and the 2PL model for 
the constrained 2PL model). The correlation is denoted as Cor (b, b). 

• The correlation between the item i?-biserial coefficient Rbi S and the estimated discrim- 
ination from the model (Cor (a,Rbi S )) 

• (For only the constrained 2PL model.) The correlation between the estimated dis- 
crimination from the model and the estimated discrimination from the 2PL model 
(Cor(a,a 2 )) 
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Table 2 

Minimum Estimated Expected Log Penalty per Item (MEELPPI) for the 
Different Models for Form 1 



Model 


Number of 
parameters 


MEELPPI 


Cor(6,p+) 


Cor(6, b) 


Cor(a, R bis ) 


Cor(a, a 2 ) 


OPL 


2 


0.6328 










Independence 


40 


0.5927 










Rasch 


41 


0.5366 


-0.98 








LLTM-IT&IC 


7 


0.6267 


-0.26 


0.26 






LLTM-Stepwise 


15 


0.5851 


-0.73 


0.70 






2PL 


80 


0.5313 


-0.99 




0.92 




2PL-C-IT&IC 


12 


0.6256 


-0.25 


0.26 


0.17 


0.04 


2PL-C-Stepwise 


28 


0.5746 


-0.50 


0.49 


0.35 


0.33 



It should be noted that regression results on item intercepts and item difficulties are 
comparable, so that the results for item intercepts and for the alternative constrained 2PL 
model are not reported. 

Interpretation of Table 2 is straightforward, except for the models based on item 
attributes from stepwise regression. The 2PL model is a bit more successful than is 
the Rasch model, but the difference is small. The Rq 12 statistic is 0.95, so that the 
preponderance of the improvement in MEELPPI from the OPL to the 2PL model is 
obtained from the transition from the OPL to the Rasch model. This result and the 
observed differences in MEELPPI are relatively common in educational tests (Haberman, 
2005, 2007). Note that d 2 — d\ = d\ — do = 39, so that the improvement in MEELPPI per 
independent parameter is uq\ = 0.0025 for the comparison of the OPL and Rasch models and 
U| 2 = 0.0001 for the comparison of the Rasch and 2PL models. The LLTM based on the IC 
and IT attributes is relatively unsuccessful. The Kq L1 statistic is only 0.06, so that relatively 
little of the improvement from the OPL to the Rasch model is explained by the LLTM. In 
addition, l = 0.0012 is a somewhat smaller improvement of MEELPPI per independent 
parameter for the comparison of the OPL model and LLTM than the corresponding value 
u TA = 0.0027 from comparison of the LLTM to the Rasch model. Similar comments apply 
to the constrained 2PL model based on the IC and IT classifications. A notable feature is 
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that the LLTM and constrained 2PL model are both less successful than the independence 
model. 

The LLTM based on the item attributes from stepwise regression is not very 
successful, but it appears more successful than the LLTM based on the IC and IT 
attributes. The R% L1 statistic is 0.50, u 0L = 0.0037, and u L1 = 0.0019, so that the LLTM 
is substantially less effective than the full Rasch model, but it does reduce MEELPPI per 
independent parameter compared to the 0PL model somewhat better than in the case of 
the LLTM based on IC and IT. Results for the constrained 2PL case are somewhat similar. 
Nonetheless, a substantial selection bias is involved due to the choice of item attributes by 
stepwise regression. To check this issue, 20 additional LLTMs were considered in which 
one item attribute indicator was 1 for each item and 13 item attribute indicators were 
selected at random from the 63 available indicators for item attributes. The additional 
restriction was imposed that the number of independent parameters be 14. For each 
combination of 13 nontrivial indicators, the MEELPPI was computed along with the R 2 
statistics for prediction of item difficulty from the indicator variables. The sample mean 
of the MEELPPI statistics was 0.6101, and the sample standard deviation was 0.0063. 
The smallest MEELPPI observed from the 20 additional models was 0.5966, and the 
corresponding value of Rq LI was 0.38, so that the results of stepwise regression were a bit 
better than those typically derived by a random use of a comparable number of indicator 
variables for item attributes. On the other hand, some reason still exists for concern about 
the reality of even the modest result achieved from the stepwise regression. A regression of 
MEELPPI on R 2 for the 20 models yields an estimated regression line of 0.6305 — 0.0820R 2 
for estimation of MEELPPI. The corresponding coefficient of determination is 0.88. The R 2 
for the 13 nontrivial item attributes from stepwise regression is 0.52, so that the regression 
predicts an MEELPPI of 0.5880, a close approximation to the observed 0.5851. Thus it is 
quite plausible that the results based on stepwise regression merely reflect the tendency of 
the stepwise regression procedure to admit more predictors than is appropriate. Similar 
remarks also apply to the constrained 2PL case. 

Table 3 provides an analysis for Form 2 that is quite comparable to the analysis for 
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Form 1, except that the 63 item attributes were not available. The results for Forms 3 and 
4 are similar to those for Form 2 and are not shown here. The MEELPPI for the LLTM 
and the constrained 2PL model are much larger than that for the Rasch and 2PL models, 
respectively, and are even larger than that for the independence model. These results, like 
the regression results above, show that IT and IC classifications are poor predictors of 
either item difficulty or item discrimination. 



Table 3 

Minimum Estimated Expected Log Penalty per Item (MEELPPI) for the 
Different Models for Form 2 



Model 


Number of 
parameters 


MEELPPI 


Cor (b,p+) 


Cor (b,b) 


Cor(a, Rbis) 


Cor(a,a2) 


OPL 


2 


0.6282 










Independence 


40 


0.5741 










Rasch 


41 


0.5175 


-0.98 








LLTM-IT&IC 


7 


0.6161 


-0.34 


0.33 






2PL 


80 


0.5137 


-0.98 




0.86 




2PL-C-IT&IC 


12 


0.6159 


-0.34 


0.36 


0.03 


0.00 



It is reasonable to conclude that the available item attributes for the four forms 
provide no basis for scale anchoring. It is no wonder then that the consensus component of 
the scale anchoring process was found tedious by the participants. 

Conclusions 

This paper describes a set of simple statistical and psychometrics techniques that 
can be used to examine if a scale anchoring study will come up with useful information. The 
techniques involve fitting of simple linear regression and IRT models to examine whether 
appropriate item attributes can predict the item difficulty and item discrimination. The 
application of the techniques to four forms of a teacher licensing examination show that the 
item attributes do not predict the item difficulty and item discrimination adequately for 
these data. So scale anchoring is not expected to provide much useful information to the 
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examinees for this test. 

The discouraging results for the example considered do not necessarily imply that 
the same results will always be observed, but they certainly indicate that success in scale 
anchoring is far from guaranteed. Presumably the adequacy of the list of item attributes 
possessed by the items is a key to the set of the techniques suggested. Such a list can be 
found in the test blueprint used by the test developers to build test forms, or such a list 
can be produced from scale anchoring another form of the same test or a similar test, ft 
is possible that our suggested techniques performed with a set of available attributes show 
that a scale anchoring study will fail to elicit useful information, but, later, in a scale 
anchoring study, the content experts come up with a different list of item attributes to 
describe the anchor levels. However, in our opinion, this situation will mostly occur for 
tests in which the test construction process is not very rigorous, so that test forms are 
created without careful attention to item attributes. Note that if a testing program intends 
to report PLDs, several researchers such as Bejar, Braun, and Tannenbaum (2007) have 
argued that the descriptors should be written early in the test development process and be 
used in developing test blueprints and item specifications. If that is done, the methodology 
suggested in this paper can be used in the initial stages of a test construction, probably 
after a trial administration and before an operational administration. Attempts to report 
PLDs from a test which was not built to do so usually will not result in much useful 
information. 

A further issue is the importance of sample size. Statistical procedures are far more 
likely to lead to satisfactory results with larger collections of items. Longer tests are thus 
more attractive targets. In addition, it is reasonable to consider multiple forms, although 
such a study has to ensure that the item difficulty and item discrimination parameters of 
the different forms are comparable to each other. 
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